The Shiny app will be self-contained. Rather than cluttering the Shiny app, this Rmarkdown document has been created for further exploration into the data.

Some findings could invert over time. I tried to make write-up as flexible as possible with inline code chunks, but–as more data gets added–the results from these tests will change. To see the results from when these tests were first performed, change the input to read_csv() so that it reads in the csv file found in the “Analysis” folder.

The Dataset

The La Junta dataset is quickly becoming one of my favorites. I suppose there is something to be said about collecting, cleaning, and analyzing data all by yourself.

if(!require(pacman)) install.packages("pacman")

# General Packages
pacman::p_load(dplyr, readr, ggplot2,
               patchwork, lubridate,
               knitr, broom, stringr,
               plotly)

# KNN Packages
pacman::p_load(rsample, class)

# Logistic Regression Packages
pacman::p_load(pscl, caret, car,
               InformationValue)

# LDA Packages
pacman::p_load(MASS)

# SVM Packages
pacman::p_load(e1071)

# The raw data for La Junta, which is on GitHub
# (as opposed to the local file):
lajunta <- read_csv(paste0("https://raw.githubusercontent.com",
                           "/Ckrenzer/Winter-Livestock-Data/main",
                           "/La%20Junta%20Market%20Reports.csv"))

# Know that I only used paste0() on the url
# to avoid getting into that dark LaTex magic
# for pdf output!

To familiarize you with the dataset, let’s take a quick peek:

Date Quantity Type Weight Price Reprod
10-20-2020 18 clr 430 179 str
10-20-2020 49 clr 515 146 str
10-20-2020 23 clr 405 165 hfr
10-20-2020 42 clr 495 130 hfr
10-20-2020 10 mix 430 174 str
10-20-2020 15 ang 465 172 str
10-20-2020 15 ang 435 170 str
10-20-2020 58 ang 495 146 str

You can see that we have data for the date of the sale, the quantity sold, the type of livestock, the weight in pounds, the price in USD, and the livestock’s reproductive status. There is also data on Buyer names, but I thought it would be rude to include that information in this table. It hasn’t proved very useful anyway.

From my previous work (see “La Junta Predictions.Rmd,” for instance), I can confidently say that the “Type” column is utterly useless. That’s a story in and of itself, I suppose. The most telling graph I’ll provide for you is this one:

The dark theme on ggplot looks really nice, don’t you think? We can clearly see four distinct groups and will explore these relationships in various ways.

KNN Clustering

Since the data clusters so nicely, I could not resist checking KNN’s performance!

Step 1: Packages

I will be using the rsample package to split the data into testing and training sets. The class package will be used for the KNN algorithm. With this algorithm, I will be classifying cattle reproductive status (cow, bull, steer, or heifer) using weight and market price.

Step 2: Prepare Data

Lucky for me, I already spent a full day cleaning this data! All I have to do is select relevant columns and make the character column a factor.

There are only two predictors in this model, weight and price.

# Fortunately, KNN is well-suited for just a few predictors
lajunta_knn <- lajunta %>% 
  dplyr::select(Weight, Price, Reprod)

# storing the string vector as a factor
lajunta_knn$Reprod <- as.factor(lajunta_knn$Reprod)

Step 3: Standardize Data

Since KNN is sensitive to different scales, so I will standardize the values of the predictors.

# The normalizing function, which transforms
# all values into the range from 0 to 1
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) 
}


# Applying normalize() on the predictor
# columns
lajunta_knn <- lajunta_knn %>% 
  mutate(across(where(is.numeric), normalize))

Step 4: Separate into Training and Test Sets

A stratified, simple random sampling technique will be used to separate values into training and testing data. Stratification is necessary in this instance because there are far fewer cows and bulls in the dataset than there are heifers or steers. Eighty percent of the values will be placed in the training set and the remaining twenty percent will go to the test set.

# Setting the random number stream
set.seed(2021)
lajunta_knn_split <- rsample::initial_split(lajunta_knn,
                                            prob = 0.80,
                                            strata = Reprod)

# As of 2/6/2021, the split is 869 in train
# and 289 in test, with a total of 1158 observations

# separating the data into two new data frames
lajunta_knn_train <- rsample::training(lajunta_knn_split)
lajunta_knn_test <- rsample::testing(lajunta_knn_split)

Step 5: Build Model

To determine the appropriate number for K, I took the square root of the number of observations in the training set.

# Determining the appropriate value of K
optimal_k <- floor(sqrt(nrow(lajunta_knn_train)))

cat("Optimal value of K:", optimal_k)
## Optimal value of K: 29
# To explain the variable names, know that the
# value of optimal_k is 29 (and some change)
# at the time of writing

# the cl argument is the factor of the true
# classifications from the training set
knn_29 <- class::knn(train = lajunta_knn_train[1:2],
                     test = lajunta_knn_test[1:2],
                     cl = lajunta_knn_train$Reprod,
                     k = optimal_k)

Step 6: Evaluate

To calculate the accuracy, the data was thrown into a confusion matrix and the number of correct classifications were calculated as a rate. Though not shown, all cows and bulls are predicted perfectly. The only data which are not classified correctly are steers and heifers.

# Create a confusion matrix--The result from knn()
# and the test data dependent variable are the
# arguments
knn_29_mat <- table(knn_29, lajunta_knn_test$Reprod)


#This function measures accuracy. It takes the confusion matrix as input
accuracy <- function(x){
  return(sum(diag(x)/(sum(rowSums(x)))) * 100)
}


cat("Accuracy", accuracy(knn_29_mat)) # 87.2% as of 2/6/2021
## Accuracy 87.19723

Step 7: Optimize

We can always do better, no? Let’s run this again in a loop with changing values of K. The K with the greatest accuracy should be used for all future predictions. Also, this value should be somewhat close to 29–if we get a number very far from 29, it is likely that we are overfitting the model. That pesky bias-variance trade-off…

# Since I am using a loop, best practice
# is to pre-allocate space in the vector
acc <- numeric(100)
for(i in 1:100){
  knn_mod <- class::knn(train = lajunta_knn_train[1:2],
                        test = lajunta_knn_test[1:2],
                        cl = lajunta_knn_train$Reprod,
                        k = i)
  acc[i] <- accuracy(table(knn_mod, lajunta_knn_test$Reprod))
  
}

The ideal value was 30–very close to 29! Using just two variables, we can predict the reproductive status of livestock with 87.5432526 percent accuracy! As previously stated, most of these discrepancies are between steers and heifers. If we are only interested in cows and bulls, we will virtually always have the right prediction!

Linear Regression

The data appears linear, so linear models seem appropriate for this dataset. There is a linear model for each reproductive status. You can see each model’s parameters below:

# All results are from 2/6/2021...


steer_fit <- lm(data = lajunta,
                subset = Reprod == "str",
                formula = Price ~ I(lubridate::mdy(Date)) +
                  Quantity + poly(Weight, degree = 2))


heifer_fit <- lm(data = lajunta,
                 subset = Reprod == "hfr",
                 formula = Price ~ I(lubridate::mdy(Date)) +
                   Quantity + poly(Weight, degree = 2))


# a 12th degree polynomial gave a higher
# R-squared, but I think that is due to overfitting
cow_fit <- lm(data = lajunta,
              subset = Reprod == "cow",
              formula = Price ~ I(lubridate::mdy(Date)) +
                poly(Weight, 3))


# a 6th degree polynomial gave a higher
# R-squared, but I think that is due to overfitting...
# Our sample is small so we want to keep the
# number of predictors small
bull_fit <- lm(data = lajunta,
               subset = Reprod == "bull",
               formula = Price ~ I(lubridate::mdy(Date)) +
                 poly(Weight, 2))

Now, let’s see how these models performed:

Steer Model Evaluation
term estimate std.error statistic p.value
(Intercept) -2865.5523759 172.5392151 -16.608122 0.0000000
I(lubridate::mdy(Date)) 0.1628263 0.0092623 17.579370 0.0000000
Quantity 0.0513613 0.0132025 3.890285 0.0001114
poly(Weight, degree = 2)1 921.8923002 124.9903964 7.375705 0.0000000
poly(Weight, degree = 2)2 1416.4769009 70.7484200 20.021322 0.0000000
## R-squared: 0.8933275 
## Ajdusted R-squared: 0.8926116 
## RSE: 7.199308
Heifer Model Evaluation
term estimate std.error statistic p.value
(Intercept) -1698.3880997 185.2321464 -9.168971 0.0000000
I(lubridate::mdy(Date)) 0.0989911 0.0099615 9.937379 0.0000000
Quantity 0.0383901 0.0123309 3.113323 0.0019682
poly(Weight, degree = 2)1 670.0592725 126.3401778 5.303612 0.0000002
poly(Weight, degree = 2)2 916.0706260 72.8931789 12.567303 0.0000000
## R-squared: 0.7312624 
## Ajdusted R-squared: 0.7288683 
## RSE: 6.515564
Cow Model Evaluation
term estimate std.error statistic p.value
(Intercept) -1502.7449038 243.6783475 -6.166920 0.0000000
I(lubridate::mdy(Date)) 0.0797198 0.0127782 6.238761 0.0000000
poly(Weight, 3)1 1314.3596519 393.0916724 3.343647 0.0013149
poly(Weight, 3)2 -618.0623086 206.0714927 -2.999262 0.0037145
poly(Weight, 3)3 475.6299170 144.6912620 3.287205 0.0015663
## R-squared: 0.4205815 
## Ajdusted R-squared: 0.3883916 
## RSE: 3.036778
Bull Model Evaluation
term estimate std.error statistic p.value
(Intercept) -2347.5276006 317.8523108 -7.385592 0.0000002
I(lubridate::mdy(Date)) 0.1269014 0.0170642 7.436691 0.0000002
poly(Weight, 2)1 710.2477722 240.2674682 2.956071 0.0072998
poly(Weight, 2)2 -203.7221274 72.9827192 -2.791375 0.0106419
## R-squared: 0.7564061 
## Ajdusted R-squared: 0.7231888 
## RSE: 2.431339

Linear regression is an art as much as it is a science. I could write a 30 page paper explaining all the EDA, providing a play-by-play commentary about potential improvements to this model. But today I think I’ll leave things as they are presented here.

See the accompanying Shiny app (“La Junta Predictions.Rmd,” also linked above) if you’d like to play with these linear models yourself!

Logistic Regression

The write-up uses data as of 2/6/2021. The code will still work, though some of my discussion may be inconsistent with the current data, as the csv gets updated over time.

There are a few things I want to try with a logistic regression. Though logistic regression can handle more than two categories, it is best-suited for binary dependent variables. Therefore, I will be evaluating three models with binary outcomes:

  1. Determining whether livestock is a steer or a heifer.
  2. Determining whether livestock is a bull or a cow.
  3. Determining whether livestock is either a bull/cow or steer/heifer.

Due to enormous differences between cows/bulls and heifers/steers, I expect zero error in the third regression.

Setup and Cleaning

First things first. In the first regression, only data for steers and heifers will be included. In the second regression, only data for bulls and cows will be included. In the third regression, steers and heifers will be recoded as “y,” for “young,” while cows and bulls will be recoded as “par,” for “parents” (and yes, I understand that bulls are not necessarily fathers; I had trouble coming up with a good name for these categories). With this in mind, let’s store data for the models:

# Data for regression 1
lajunta_logit1 <- lajunta %>% 
  filter(Reprod %in% c("str", "hfr")) %>% 
  mutate(across(where(is.character), as.factor))


# Data for regression 2
lajunta_logit2 <- lajunta %>% 
  filter(Reprod %in% c("bull", "cow")) %>% 
  mutate(across(where(is.character), as.factor))

# Data for regression 3
lajunta_logit3 <- lajunta %>% 
  mutate(Reprod = lajunta[["Reprod"]] %>% 
           str_replace_all("bull|cow", "par") %>% 
           str_replace_all("str|hfr", "y")) %>% 
  mutate(across(where(is.character), as.factor))

Now that we have everything in a tibble, we are ready to split up the data.

Split into Training and Test Sets

The data will be split into testing and training sets with an 80/20 split. For the same reasons used in the KNN classification section, stratified sampling will be used.

# REGRESSION 1 (steer vs heifer)
# Setting the random number stream
set.seed(2021)
# Stratification for the first regression
lajunta_logit1_split <- rsample::initial_split(lajunta_logit1,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit1_train <- rsample::training(lajunta_logit1_split)
lajunta_logit1_test <- rsample::testing(lajunta_logit1_split)



# REGRESSION 2 (bull vs cow)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit2_split <- rsample::initial_split(lajunta_logit2,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit2_train <- rsample::training(lajunta_logit2_split)
lajunta_logit2_test <- rsample::testing(lajunta_logit2_split)





# REGRESSION 3 (bull/cow vs. steer/heifer)
# Setting the random number stream again
# (in the event R tries to pull some
# funny business)
set.seed(2021)
# Stratification for the first regression
lajunta_logit3_split <- rsample::initial_split(lajunta_logit3,
                                               prob = 0.80,
                                               strata = Reprod)
# separating the data into two new data frames
lajunta_logit3_train <- rsample::training(lajunta_logit3_split)
lajunta_logit3_test <- rsample::testing(lajunta_logit3_split)

Models

The data is now ready for model assembly. I will be using the stats package’s glm() function as the engine for the regressions.

# REGRESSION 1 (steer vs heifer)
# Quantity is not very different between steers and heifers
logit1 <- glm(data = lajunta_logit1_train,
              formula = Reprod ~ Weight + Price,
              family = "binomial")

# REGRESSION 2 (bull vs cow)
logit2 <- glm(data = lajunta_logit2_train,
              formula = Reprod ~ Weight + Price + Quantity,
              family = "binomial")

# REGRESSSION 3 (bull/cow vs. steer/heifer)
logit3 <- glm(data = lajunta_logit3_train,
              formula = Reprod ~ Weight + Price + Quantity,
              family = "binomial")

Here are the results from the first regression (steers vs heifers):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price, family = "binomial", data = lajunta_logit1_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.92512  -0.53865   0.07971   0.50108   2.56263  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -43.151830   2.996368  -14.40   <2e-16 ***
## Weight        0.022718   0.001756   12.94   <2e-16 ***
## Price         0.206868   0.014514   14.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1082.62  on 791  degrees of freedom
## Residual deviance:  555.51  on 789  degrees of freedom
## AIC: 561.51
## 
## Number of Fisher Scoring iterations: 6

From the second regression (bull vs cow):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial", 
##     data = lajunta_logit2_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.168e-05  -1.100e-08   2.100e-08   2.100e-08   3.326e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.555e+02  3.333e+05   0.001    0.999
## Weight      -3.964e-02  2.872e+02   0.000    1.000
## Price       -2.486e+00  6.909e+03   0.000    1.000
## Quantity    -1.647e+00  1.758e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.8806e+01  on 77  degrees of freedom
## Residual deviance: 1.8179e-09  on 74  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

And the third regression (bull/cow vs. steer/heifer):

## 
## Call:
## glm(formula = Reprod ~ Weight + Price + Quantity, family = "binomial", 
##     data = lajunta_logit3_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -6.329e-05   2.100e-08   2.100e-08   2.100e-08   9.029e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.621e+01  5.892e+04   0.000    1.000
## Weight      -2.885e-02  3.705e+01  -0.001    0.999
## Price        5.788e-01  3.953e+02   0.001    0.999
## Quantity     2.519e-01  1.629e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5.2484e+02  on 868  degrees of freedom
## Residual deviance: 2.1086e-08  on 865  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

The predictors in the first regression did a pretty good job and, judging by the p-values, the predictors in the second and third regressions did an abysmal job!

The correct interpretation of the Price variable in the first regression, however, is that a one dollar increase in the price of the livestock is associated with an average increase of 0.2068676 in the log odds of being a steer.

Well, this is quite strange. Let’s continue on to the evaluation and see if we can figure out what’s happening.

Evaluation

McFadden’s R-square

We can use the pR2() function from the pscl package to compute McFadden’s R-square value, which ranges from 0 to just under 1. Values close to 0 indicate the model as no predictive power. Values over 0.4 suggest the model is a good fit for the data.

Let’s begin with the first regression:

## fitting null model for pseudo-r2
## 0.4868792

As expected, the predictors in the first regression do a pretty good job. Let’s see the McFadden statistic from the second regression:

## fitting null model for pseudo-r2
## 1

Ah! I see now. The model acts strangely because it is deterministic! I expect this to be the case for the third regression as well:

## fitting null model for pseudo-r2
## 1

Yup.

Variable Importance

We can compute the importance of each predictor variable in the model by using the varImp() function from the caret package. Higher values indicate more importance. These values can help confirm the validity of p-values.

The first regression:

Overall
Weight 12.93487
Price 14.25307

The values are both pretty big, so both of the predictors seem to be ‘carrying their weight.’ This is consistent with our analysis thus far.

What about the second regression?

Overall
Weight 0.0001380
Price 0.0003598
Quantity 0.0000937

That is certainly consistent with the p-values. The third regression?

Overall
Weight 0.0007788
Price 0.0014643
Quantity 0.0001546

Quite low, quite low.

Variance Inflation Factor

We can calculate the VIF of each variable to see if multicollinearity is a problem. VIF scores greater than 5 indicate multicollinearity is an issue that needs to be addressed. Since there are so few variables, multicollinearity is unlikely to have much effect on the results.

The first regression:

##   Weight    Price 
## 2.964014 2.964014

What a relief! Multicollinearity, as measured by VIF, is not a big problem in the data.

The second regression:

##   Weight    Price Quantity 
## 5.236167 5.687037 1.982861

Eh, this could be a problem, but they are still pretty close to five. I opt to willfully ignore the problem. Not shown here, but I already tried transforming the variables to no avail.

The third regression:

##   Weight    Price Quantity 
## 3.182178 1.993148 4.722170

Good. These scores are not very worrying to me.

Test Data

Finally, we can evaluate model performance by using the test dataset. By default, any observation in the test dataset with a probability greater than 0.5 will be predicted to be a steer, cow, or heifer/steer (a “y”), respectively (for regressions 1, 2, and 3). We can calculate these probabilities with this code:

# REGRESSION 1 (steer vs heifer)
# calculating the probability of being a steer
# and storing the results in a vector
logit1_test_probs <- predict(logit1,
                             lajunta_logit1_test,
                             type = "response")

# REGRESSION 2 (bull vs cow)
# calculating the probability of being a cow
# and storing the results in a vector
logit2_test_probs <- predict(logit2,
                             lajunta_logit2_test,
                             type = "response")

# REGRESSSION 3 (bull/cow vs. steer/heifer)
# calculating the probability of being a
# heifer/steer (a "y") and storing the
# results in a vector
logit3_test_probs <- predict(logit3,
                             lajunta_logit3_test,
                             type = "response")

Thankfully, mathematical wizards around the world have come together to develop a magical way of determining the optimal probability that maximizes accuracy of the model. The optimalCutoff() function from the InformationValue package finds this value.

Since regressions 2 and 3 are deterministic, this process will only be done on the first regression. One problem with this approach, I suppose you could argue, is that it risks overfitting the model. You can decide for yourself what you think.

# converting "str" to 1's and "hfr" to 0's
# This is needed for the optimalCutoff() call
# below
lajunta_logit1_test$Reprod <- ifelse(lajunta_logit1_test$Reprod == "str", 1, 0)

# finding optimal cutoff probability to use to maximize accuracy
logit1_optimal <- optimalCutoff(lajunta_logit1_test$Reprod,
                                logit1_test_probs)[1]

# print the optimal probability
cat(logit1_optimal)
## 0.5399724

The optimal probability is 0.5399724. With this new threshold (the old threshold was 0.5, remember), any livestock with a probability of being a steer at or above 0.5399724 will be classified as a steer, with the rest being classified as heifers.

Confusion Matrix

Using the optimal threshold, we can create a confusion matrix showing how the model performed on the test data.

As always, let’s begin with the first regression:

##   hfr str
## 0  90  20
## 1  23 130

The zeroes are heifers and the ones are steers. As we can see, 23 heifers were incorrectly identified as steers and 20 steers were incorrectly identified as heifers. Cool!

Let’s repeat the process for the second regression:

##   bull cow
## 0    6   0
## 1    0  19

The model is deterministic. Zero misclassifications!

I expect the same for the third regression:

##   par   y
## 0  25   0
## 1   0 264

Deterministic indeed.

ROC Curves

We can also calculate the sensitivity (the “true positive rate”), specificity (the “true negative rate”), and the total number of misclassifications using the numbers from the confusion matrix. I’ll spare you the details on the sensitivity and specificity and cut straight to the error rate. Further, since we already know regressions 2 and 3 are deterministic, I won’t waste anyone’s precious time showing that the error rate is 0 percent in these two models.

The misclassification error rate for the first regression at the optimal threshold:

## 0.1521

The data were classified incorrectly on the test data 15.21 percent of the time.

Finally, we can put together ROC curves to visualize how well the models fit. The greater the AUC (area under the curve), the more accurately the model predicts.

For the first regression

The first regression seems to be a reasonably good fit, as measured by the ROC curve.

The second curve? That’s just a gigantic blue rectangle. The same is true of the ROC curve for the third regression. My intuition at the outset of this section was correct. There was no error in the third regression.

Linear Discriminant Analysis (LDA)

While we are in the neighborhood of classification models, why don’t we give LDA a go? As we all know by this point, there are four categories to be interested in classifying: bull, cow, steer, and heifer. Also, as you may have guessed, the predictors are price, weight, and quantity. Date is a strange variable, and Type has too little predictive power to bother including.

Packages

The engine for the LDA model is the one provided in the MASS package. As usual, the sampling will be done with the rsample package–in the same manner as the previous classifications.

Scale Data

Since one of the key assumptions of LDA is that each predictor has the same variance, they should all be scaled to give the data a mean of zero and a standard deviation of one. We can do this with a mutate(across()) call:

lajunta_lda <- lajunta %>% 
  mutate(across(where(is.numeric), scale),
         Date = lubridate::mdy(Date))

And we can confirm that the standard deviation is, indeed, 1 after transformation:

lajunta_lda %>% 
  summarize(across(where(is.numeric), sd)) %>% 
  unlist()
## Quantity   Weight    Price 
##        1        1        1

Excellent. We are now ready to separate into training and test samples.

Samples

Due to imbalances in the number of observations in reproductive status of the livestock, stratified sampling will (again) be used. Okay, let’s make the 80/20 split:

# Setting the random number stream
set.seed(2021)
# Stratification for the
lajunta_lda_split <- rsample::initial_split(lajunta_lda,
                                               prob = 0.80,
                                               strata = Reprod)

# separating the data into two new data frames
lajunta_lda_train <- rsample::training(lajunta_lda_split)
lajunta_lda_test <- rsample::testing(lajunta_lda_split)

Build Model

Now, let’s fit the model and see the results:

lajunta_lda_fit <- MASS::lda(Reprod ~ Price + Weight + Quantity,
                       data = lajunta_lda_train)

lajunta_lda_fit
## Call:
## lda(Reprod ~ Price + Weight + Quantity, data = lajunta_lda_train)
## 
## Prior probabilities of groups:
##       bull        cow        hfr        str 
## 0.02301496 0.06559264 0.39470656 0.51668585 
## 
## Group means:
##           Price     Weight    Quantity
## bull -1.8095083  4.4046724 -1.06414298
## cow  -2.6963960  2.1141854 -0.89919204
## hfr  -0.1000010 -0.2776876  0.06849784
## str   0.4875391 -0.2601619  0.11887201
## 
## Coefficients of linear discriminants:
##                 LD1       LD2        LD3
## Price    -0.5380345 2.6456972  0.1617400
## Weight   -2.8054740 1.9466759 -0.1821785
## Quantity  0.3467151 0.2264476 -0.9590007
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7665 0.2334 0.0001

Interpretation

Let’s break this output down bit by bit…

Prior Probabilities of Groups

These represent the proportions of each reproductive status of the livestock in the training set. Bulls, for example, make up 2.3 percent of the observations in the training dataset.

Group Means

The group means are the mean values for each predictor in the model, broken down by the possible outcomes for the dependent variable.

Coefficients of Linear Discriminants

These coefficients represent the linear combination of predictor variables that are used to form the decision rule of the model. For example,

LD1 = -(0.5380345)(Price) - (2.805474)(Weight) + (0.3467151)(Quantity)

Proportion of Trace

Displays the percentage separation achieved by each linear discriminant function (Ex. if LD1 was 0.73, we would say that “LD1 is responsible for 73% of the separation between the groups”).

Evaluation

Following our prepare-split-train process, we have now come to the step where we analyze the test data.

lajunta_lda_probs <- predict(lajunta_lda_fit,
                             lajunta_lda_test)

There are three elements of a list from this prediction–the predicted class, posterior probability than an observation belongs to each class (the probability of the observation being a steer, heifer, bull, or cow–with these four probabilities summing up to 1), and the linear discriminants.

Let’s see the first few posterior probabilities in the test set:

bull cow hfr str
0 0 0.4364847 0.5635153
0 0 0.7021184 0.2978816
0 0 0.9482221 0.0517779
0 0 0.6879388 0.3120612
0 0 0.9676706 0.0323294
0 0 0.7826029 0.2173971
0 0 0.9514504 0.0485496
0 0 0.9755308 0.0244692

I imagine that I am losing you by this point, because the outcome for these classification problems all approximate the same thing (that the graph in the introduction shows plain as day). I think it is good to use several approaches to confirm results, however, because that only strengthens our understanding of the interaction between the variables.

With that being said, observation of the above table once again confirms that the difference between heifers and steers is far more ambiguous than differences between any other cattle reproductive statuses.

The question I think you are likely most interested in, “Does LDA perform better than KNN or any of the logistic regressions,” can be answered easily by comparing the model’s classifications of the test data to the actual observations in the test set.

# For you non-R programmers out there,
# R mathematical functions translate
# TRUE's into 1's and FALSE's into 0's.
# In other words, sum(TRUE, TRUE, FALSE) returns 2.
mean(lajunta_lda_probs$class == lajunta_lda_test$Reprod)

Drum roll please! The model correctly identified the true reproductive status 0.8477509 percent of the time! Well, at least now we know that KNN is a better predictor if we want to classify reproductive status in the future! Regardless, they are both reasonably close to one another, so KNN is probably a better choice.

Often times, the people we need to convey insights to have no background in statistics; Occam’s Razor means something approaching, “the simpler explanation is usually the most correct.” Chances are that we will have a much easier time explaining KNN than we would describing linear discriminant analysis. If a simpler method does a “good-enough” job, choose the simpler method.

Visualization of Separation

Finally, we can visualize how well LDA separated the different reproductive statuses of cattle:

lajunta_lda_plot <- cbind(lajunta_lda_train,
                    predict(lajunta_lda_fit)$x)

#create plot
ggplot(lajunta_lda_plot, aes(LD1, LD2)) +
  geom_point(aes(color = Reprod)) +
  ggtitle("Linear Discriminant 1 vs Linear Discriminant 2") +
  theme_dark()

We could make similar plots using LD3 as well, but LD1 and LD2 are by far the most important linear discriminants. There is no real justification for including LD3.

For those of you looking to improve this model, try transforming some of the predictors!

Support Vector Machines (SVM)

I am in the mood to keep going with these classification models. I want to get a model that predicts with 95 percent accuracy, but we haven’t met that threshold with any of the preceding models. This model is similar to the logistic regression model, but instead of filtering data down to just steers and heifers, it includes cows and bulls as well. Only Weight and Price will be used

The question this model answers: “is the livestock a steer?”

Packages

The e1071 package will be the engine used to implement the algorithm.

Sampling

Computing Support Vectors

Evaluation

Hyperplane Plot

The algorithm is supposed to classify observations by creating a hyperplane with the maximum possible margins (think of margins as the “buffer zone” between the points and the plane–we want the largest “buffer zone” possible).

plotly::plot_ly(x = lajunta$Weight,
                y = lajunta$Price,
                z = lajunta$Quantity,
                color = lajunta$Reprod) %>% 
  layout(title = "Livestock by Weight, Price, and Quantity")
  ggplot(lajunta) +
  geom_point(mapping = aes(x = Quantity, y = Price, color = Reprod))

Test Set

In this section, you will predict the new values on the test set, then compare them to the actual, giving you the model’s accuracy.

Then, you will summarize findings and compare it to the other models in this file.

Thank You!

If you’ve made it this far–you’re a trooper! I hope you enjoyed going through this as much as I did! If you have any questions or want to provide feedback, do not hesitate to reach out. Take care now!